TheSys - A comprehensive thesaurus system for intelligent document analysis and text retrieval
نویسندگان
چکیده
Well designed thesauri can represent seman-tic/conceptual knowledge so as to reveal relationships among diierent elements in documents, thus serving as a critical tool in intelligent text retrieval systems and document analysis systems. In this paper, we present a thesaurus system, referred to as TheSys, which can be used as a tool for users to build thesauri according to their own requirements. It is our goal to design a comprehensive thesaurus building tool which can be used in any eld of specialty rather than targeting for a particular specialty eld. People can use our system to build an electronic thesaurus in any specialty eld required for a speciic application. We propose a thesaurus model, referred to as the thesaurus frame, which uses weighted links, to represent semantic relationships among concepts and terms. Our approach is to use a set of controlled terms, referred to as seman-temes, to build the thesaurus frame. This approach can eeectively reduce the size of the thesaurus yet the intelligence of the thesaurus is not compromised.
منابع مشابه
ارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملLarge-Scale Linguistic Ontology as a Basis for Text Categorization of Legislative Documents
The paper describes the structure and properties of a large linguistic ontology – a new kind of information retrieval thesaurus Thesaurus on Sociopolitical Life for Conceptual Indexing. The thesaurus is used in various realscale information-retrieval applications in the legal domain. At present one of the main applications of the Thesaurus is knowledge-based text categorization. Categories are ...
متن کاملDeriving Concepts Hierarchy
Information Retrieval (IR) covers the problems relating to the effective storage, access, searching and locating documents that are relevant for user’s information need or query from large collection documents. Many techniques and tools have been developed to improve these processes. One of these tools is the thesaurus. This paper will present a tool for users to build thesauri according to the...
متن کاملStudy of Ontology or Thesaurus Based Document Clustering and Information Retrieval
Document clustering generates clusters from the whole document collection automatically and is used in many fields, including data mining and information retrieval. Clustering text data faces a number of new challenges. Among others, the volume of text data, dimensionality, sparsity and complex semantics are the most important ones. These characteristics of text data require clustering techniqu...
متن کامل